In previous courses or even previous lecture, we always generally assume data was i.i.d. for convenience purpose, however this may be a poor assumption. Many real life problems are not i.i.d. instead they are sequential data. That is, we make the simplifying assumption that our data can be modeled as a first-order Markov chain p(xt∣x1:t−1)=p(xt∣xt−1).
Since the assumption is still restrictive where the states of our variables are fully observed. To solve this problem, we can use Hidden Markov Model (HMM). HMMs hide the temporal dependence by keeping it in the unobserved state (No assumptions on the temporal dependence of observations is
made). And then for each observation xt, we have a corresponding unobserved hidden/latent variable zt.
- we have the joint distribution p(x1:T,z1:T)=p(z1)∏t=2Tp(zt∣zt−1)∏t=1Tp(xt∣zt).
HMM observations are not limited by any order of Markov chain. For a homogeneous HMM, we have the following:
- Initial Distribution: π(i)=p(z1=i). The probability of the first hidden variable being in state i.
- Transition Distribution: Ψ(i,j)=p(zt+1=j∣zt=i),i∈{1,…,k}. The probability of moving from hidden state i to hidden state j.
- Emission Distribution: ψt(i)=p(xt∣zt=i). The probability of an observed random variable x given the state of the hidden variable that "emitted" it.
Forward-Backward Algorithm
In order to compute the probability of a latent sequence given an observation sequence. (i.e. p(z1:t∣x1:t)). We can use the Forward-Backward Algorithm to solve such task.
This taks of hidden state inference breaks down into serveral subtasks:
- Filtering: compute the posterior over current hidden state, p(zt∣x1:t).
- Prediction: compute the posterior over future hidden state, p(zt+k∣x1:t).
- Smoothing: compute the posterior over past hidden state, p(zk∣x1:t),1<k<t.
We can use Forward Recursion p(zt∣x1:t) and Backward Recursion p(x1+t:T∣zt). That is, γt=p(zt∣x1:T)∝p(zt,x1:T)=p(zt,x1:t)p(xt+1:T∣zt,x1:t)=p(zt,x1:t)p(xt+1:T∣zt)∝αt(i)βt(i) since (xt+1:T⊥x1:t∣zt).
Forward Recursion
We firstly define αt(j)=p(zt=j∣x1:t),t∈[1,T] and assume that p(z1),p(zt∣zt−1) and p(xt∣zt) are known. Then we can compute αt(j) recursively as follows:
-
(Prediction step) compute the one-step-ahead predictive density (i.e. the new prior for time t)
p(zt=j∣x1:t−1)=∑ip(zt=j∣zt−1=i)p(zt−1=i∣x1:t−1)=∑iΨ(i,j)αt−1(i).
-
(Update step) αt(j)=p(zt=j∣x1:t